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Abstract 

A  framework  for  the  regularized  and  robust  estimation  of 
non-uniform  dimensionality  and  density  in  high  dimen¬ 
sional  noisy  data  is  introduced  in  this  work.  This  leads  to 
learning  stratifications,  that  is,  mixture  of  manifolds  repre¬ 
senting  different  characteristics  and  complexities  in  the  data 
set.  The  basic  idea  relies  on  modeling  the  high  dimensional 
sample  points  as  a  process  of  Translated  Poisson  mixtures, 
with  regularizing  restrictions,  leading  to  a  model  which  in¬ 
cludes  the  presence  of  noise.  The  Translated  Poisson  dis¬ 
tribution  is  useful  to  model  a  noisy  counting  process,  and 
it  is  derived  from  the  noise-induced  translation  of  a  regu¬ 
lar  Poisson  distribution.  By  maximizing  the  log-likelihood 
of  the  process  counting  the  points  falling  into  a  local  ball, 
we  estimate  the  local  dimension  and  density.  We  show  that 
the  sequence  of  all  possible  local  counting  in  a  point  cloud 
formed  by  samples  of  a  stratification  can  be  modeled  by  a 
mixture  of  different  Translated  Poisson  distributions,  thus 
allowing  the  presence  of  mixed  dimensionality  and  densi¬ 
ties  in  the  same  data  set.  With  this  statistical  model,  the  pa¬ 
rameters  which  best  describe  the  data,  estimated  via  expec¬ 
tation  maximization,  divide  the  points  in  different  classes 
according  to  both  dimensionality  and  density,  together  with 
an  estimation  of  these  quantities  for  each  class.  Theoretical 
asymptotic  results  for  the  model  are  presented  as  well.  The 
presentation  of  the  theoretical  framework  is  complemented 
with  artificial  and  real  examples  showing  the  importance  of 
regularized  stratification  learning  in  high  dimensional  data 
analysis  in  general  and  computer  vision  and  image  analysis 
in  particular. 

1  Introduction 

Recently,  there  has  been  significant  interest  in  analyzing  the 
intrinsic  structure  of  high  dimensional  data,  this  is  com¬ 


monly  known  as  manifold  learning ,  e.g.,  [4,  6,  9,  20,  23,  28, 
33].  Often,  points  that  live  in  a  high  dimensional  space  can 
be  parametrized  by  a  number  of  parameters  much  smaller 
than  the  ambient  dimension.  A  representation  (embedding) 
of  the  data  in  a  lower  dimensional  space  is  very  helpful  for 
analysis  and  computations  on  the  dataset. 

Most  of  the  works  on  manifold  learning  rely  on  the  hy¬ 
pothesis  that  all  the  points  under  analysis  are  samples  of  the 
same  manifold  and  thus  there  is  a  unique  intrinsic  dimen¬ 
sion.  However,  this  is  often  not  a  correct  assumption.  It  is 
likely  that,  for  example,  a  collection  of  image  portraits  of 
the  same  person  under  varying  pose  and  illumination,  lies 
on  a  manifold  defined  by  a  set  of  parameters  related  to  the 
variations  in  pose  and  illumination.  On  the  other  hand,  let 
us  consider  a  set  of  images  representing  scanned  digits.  It 
might  happen  that  the  images  representing  the  digit  4 1  ’  can 
be  described  with  a  different  number  of  parameters  than  the 
images  for  the  digit  ‘2.’  Videos  of  diverse  human  motions 
contain  the  same  complexity  variability.  In  these  cases,  it 
is  important  to  detect  that  there  are  different  complexities 
present  in  the  same  (noisy)  point  cloud  data.  This  is  the 
subject  of  this  work. 

This  problem,  clustering-by-dimensionality  and  stratifi¬ 
cation  learning ,  has  recently  been  explored  in  a  handful  of 
works.  Barbara  and  Chen,  [3],  proposed  a  hard  clustering 
technique  based  on  the  fractal  dimension  (box-counting). 
Starting  from  an  initial  clustering,  they  incrementally  add 
points  into  the  cluster  for  which  the  change  in  the  fractal  di¬ 
mension  after  adding  the  point  is  the  lowest.  They  also  find 
the  number  of  clusters  and  the  intrinsic  dimension  of  the  un¬ 
derlying  manifolds.  Gionis  et  al. ,  [13],  propose  a  two-step 
algorithm:  First,  they  estimate  the  local  correlation  dimen¬ 
sion  and  density  for  each  point;  then,  standard  clustering 
techniques  are  used  to  cluster  the  two-dimensional  repre¬ 
sentation  (dimension  +  density)  of  the  data.  Souvenir  and 
Pless,  [31],  use  an  Expectation  Maximization  (EM)  type  of 
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technique,  combined  with  weighted  geodesic  multidimen¬ 
sional  scaling  (weighted  ISOMAP  [33]).  The  weights  mea¬ 
sure  how  well  each  point  fits  the  underlying  manifold  de¬ 
fined  by  the  current  set  of  points  in  the  cluster.  After  clus¬ 
tering,  each  cluster  dimensionality  is  estimated  following 
[23].  Vidal  et  al.,  [18,  36],  cluster  linear  subspaces  with  an 
algebraic  geometric  method  based  on  polynomial  differen¬ 
tiation,  called  Generalized  PC  A  (GPCA),  which  also  finds 
the  number  of  linear  subspaces  and  their  intrinsic  dimen¬ 
sions.  An  algorithm  for  clustering  linear  manifolds  based  on 
lossy  coding  was  proposed  by  Ma  et  al.  [25] .  Goh  and  Vidal 
[14]  extend  [27]  to  cluster  a  union  of  J,  non- intersecting,  k- 
connected  nonlinear  manifolds.  It  is  done  with  the  vectors 
spanning  the  null  space  of  the  LLE  matrix  [28],  which  are  a 
linear  combination  of  the  membership  vectors  and  the  em¬ 
bedding  vectors  of  the  J  connected  components.  The  work 
of  Mordohai  and  Medioni,  [26],  estimates  the  local  dimen¬ 
sion  using  tensor  voting.  Cao  and  Haralick,  [7],  propose  a 
hard  clustering  by  dimensionality:  First,  local  dimensional¬ 
ity  is  computed  via  local  PCA;  and  then,  neighboring  points 
are  clustered  together  if  they  have  the  same  dimension  and 
if  the  error  of  representing  the  new  cluster  as  a  combination 
of  basis  functions  in  a  kernel-based  feature  space  is  small. 
Among  these  clustering-by-dimensionality  techniques,  only 
the  one  by  Cao  and  Haralick  includes  spatial  information  in 
order  to  obtain  a  regularized  classification.  Recently,  Lu 
and  Vidal,  [24],  combined  GPCA  with  an  additional  spa¬ 
tial  constraint  in  a  k- means  fashion.  They  showed  that, 
by  adding  this  constraint,  the  classification  is  improved  in 
the  intersection  of  the  linear  subspaces.  From  the  computa¬ 
tional  geometry  perspective,  a  Voronoi-based  technique  to 
compute  local  dimensionality  has  been  introduced  in  [11], 
and  demonstrated  for  3D  point  cloud  data.  The  diffusion 
distance  framework,  [8,  22],  can  work  with  stratifications, 
though  no  explicit  estimation  of  the  clusters  is  performed 
and  single  maps  into  Euclidean  space  are  performed  for  the 
whole  data  set.  Recently,  and  following  in  part  the  theory 
of  persistent  topology  [12],  a  framework  for  studying  stratas 
based  on  local  homology  has  been  introduced  in  [5]. 

These  recent  works  have  clearly  shown  the  necessity  to 
go  beyond  manifold  learning,  into  “stratification  learning.” 
In  our  work,  we  do  not  assume  linear  subspaces,  and  we 
simultaneously  estimate  the  soft  clustering  and  the  intrin¬ 
sic  dimension  and  density  of  the  clusters  while  being  ro¬ 
bust  to  noise  and  outliers.  This  collection  of  attributes  is 
not  shared  by  any  of  the  pioneering  works  just  described. 
Our  approach  is  an  extension  of  the  Eevina  and  Bickel’s 
local  dimension  estimator  [23].  They  proposed  to  com¬ 
pute  the  intrinsic  dimension  at  each  point  using  a  Maxi¬ 
mum  Eikelihood  (ME)  estimator  based  on  a  Poisson  dis¬ 
tribution.  We  propose  to  compute  a  ML  on  the  whole  point 
cloud  data  at  the  same  time  (and  not  one  for  each  point  in¬ 
dependently),  based  on  a  Translated  Poisson  mixture  model, 


which  models  the  presence  of  noise  and  permits  to  have  dif¬ 
ferent  classes  (each  one  with  their  own  dimension  and  sam¬ 
pling  density).  This  technique  automatically  gives  a  soft 
clustering  according  to  dimensionality  and  density,  with  an 
estimation  of  both  quantities  for  each  class.  A  preliminary 
version  of  this  work  was  presented  in  [15]  and  a  regularized 
version  together  with  asymptotic  results  in  [16] .  These  tech¬ 
niques  are  particular  cases  of  the  more  general  Translated 
Poisson  model  introduced  in  this  paper  in  order  to  handle 
noise.1 

The  remainder  of  this  paper  is  organized  as  follows:  In 
Section  2  we  review  the  method  proposed  by  Levina  and 
Bickel,  [23],  which  gives  a  local  estimation  of  the  intrin¬ 
sic  dimension  and  has  inspired  our  work.  We  reformulate 
this  approach  in  Section  3  in  order  to  include  the  presence 
of  noise  in  the  statistical  model.  Section  4  explains  our  ap¬ 
proach  for  robust  stratification  learning.  We  show  experi¬ 
ments  with  synthetic  and  real  data  in  Section  5,  including 
comparisons  with  critical  literature,  and  finally,  conclusions 
are  presented  in  Section  6. 

2  Local  intrinsic  dimension  estima¬ 
tion 

Levina  and  Bickel,  [23],  proposed  a  geometric  and  prob¬ 
abilistic  method  which  estimates  the  local  dimension  and 
density  of  a  point  cloud  data.  This  dimension  estimator 
is  equivalent  to  the  one  proposed  in  [32]  in  the  context 
of  dynamical  systems.  Their  approach  is  based  on  the 
idea  that  if  we  sample  an  m- dimensional  manifold  with  T 
points,  the  proportion  of  points  that  fall  into  a  ball  around 
a  point  xt  is  ^  «  p(xt)V(m)Rk(xt)rn .  The  given  point 
cloud,  embedded  in  high  dimensions  D,  is  X  =  {xt  G 
t  =  1, . . . ,  T},  k  is  the  number  of  points  inside  the 
ball,  p(xt)  is  the  local  sampling  density  at  point  xt ,  V (m)  is 
the  volume  of  the  unit  sphere  in  Mm,  and  Rk{%t)  is  the  Eu¬ 
clidean  distance  from  xt  to  its  k- th  nearest  neighbor  (kNN). 
Then,  they  consider  the  inhomogeneous  process  N(R,  xt ), 
which  counts  the  number  of  points  falling  into  a  small  D- 
dimensional  sphere  B(R,xt)  of  radius  R  centered  at  xt. 
This  is  a  binomial  process,  and  some  assumptions  need 
to  be  done  to  proceed.  First,  if  T  — ►  oo,  k  — ►  oo,  and 
k/T  — >  0,  then  we  can  approximate  the  binomial  process 
by  a  Poisson  process.  Second,  the  density  p(xt)  is  consid¬ 
ered  constant  inside  the  sphere,  a  valid  assumption  for  small 

1  We  should  mention  that  in  [15]  we  compared  the  original  framework 
(with  no  regularization  or  noise  modelling  as  here  developed),  with  a  two 
step  approach,  where  we  first  estimate  the  local  dimensionality  per  point 
using  the  original  Levina-Bickel  approach,  and  then  cluster  following  the 
information  bottleneck  approach  [34].  This  has  been  shown  not  only  to 
be  less  elegant  and  mathematically  funded  than  the  approach  here  pre¬ 
sented,  but  mush  less  robust,  even  when  compared  to  the  non-regularized 
and  noise-transparent  formulation. 
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R.  Note  that  the  latter  assumption  is  only  local,  the  global 
density  does  not  need  to  be  constant,  only  inside  the  local 
sphere.  With  these  assumptions,  the  rate  A  of  the  counting 
process  N(R,  xt)  can  be  written  as 

X(R,xt)  =p(xt)V(m)mRm-1.  (1) 

The  log-likelihood  of  the  process  N(R,  xt)  is  then  given  by 


according  to  a  conditional  probability  density  f(z\x),  called 
the  transition  density. 

For  our  purposes,  we  are  going  to  consider  the  particular 
case  where  each  point  is  translated  independently  of  the  oth¬ 
ers  and  there  are  no  deletions  or  insertions  in  the  translation 
process  (these  more  general  cases  are  also  studied  in  [30]). 
We  have  the  following  critical  theorem  [30]  which  says  that 
a  translated  Poisson  process  is  also  a  Poisson  process: 


R  pR 

log  A (r,xt)dN(r,xt)~  /  A (r,xt)dr, 
J  o 

where  0(xt)  :=  log  p(xt)  is  the  density  parameter  and  the 
first  integral  is  a  Riemann-Stieltjes  integral  [29].  The  max¬ 
imum  likelihood  estimators  lead  to  a  computation  for  the 
local  dimension  at  point  xt,  Rn{pct),  depending  on  all  the 
neighbors  within  a  distance  R  from  xt  [23] .  In  practice,  it 
is  more  convenient  to  compute  a  fixed  amount  k  of  nearest 
neighbors.  Thus,  the  local  estimators  at  point  xt  are 


L(m(xt),0(xt))  =  [ 
Jo 


m(xt)  ■■ 


—  Y  logLAll 
k -l  Y  gRj{xt) 

3= 1  J 


(2) 


9{xt)=  log  ((fc  -  1)/  (v(m{xt))Rk{xt)m^y)  ,  (3) 


where  V(m(xt ))  =  (2nm^Xt^2) / (m(xt)T(m^ )),  and 
T ( m(xt) )  =  J^00  trn^Xt>)l2~1e~tdt.  If  the  data  points  be¬ 
long  to  the  same  manifold,  the  authors  propose  to  average 
over  all  local  estimators  m(xt)  in  order  to  obtain  a  more  ro¬ 
bust  estimator.  However,  if  there  are  two  or  more  manifolds 
with  different  dimensions,  the  average  does  not  make  sense, 
unless  we  first  cluster  according  to  dimensionality  and  then 
estimate  the  dimensionality  for  each  cluster.  Another  possi¬ 
bility  is  to  include  this  in  the  process  via  the  simultaneous 
soft  clustering  and  estimation  technique  described  in  Sec¬ 
tion  4.  Before  this,  let  us  present  the  proposed  framework 
to  naturally  handle  noise  as  part  of  the  model. 


3  Translated  Poisson  model 

Usually,  point  samples  are  contaminated  with  noise,  thus 
the  point  process  that  we  observe  is  not  a  simple  sampling 
of  a  low  dimensional  manifold  but  a  perturbation  of  this 
sample  process.  This  can  be  modeled  with  a  Translated 
Poisson  Process  [30],  where  an  underlying  (unobservable) 
point  process  is  translated  to  an  output  (observable)  point 
process.  The  input  and  output  spaces  of  the  points  are  not 
necessarily  the  same  or  even  of  the  same  dimension  (clearly, 
noise  brings  points  outside  of  the  underlying  manifold  and 
into  the  higher  dimensional  embedding  space).  More  con¬ 
cretely,  an  input  point  at  location  x  in  the  input  space  X 
is  randomly  translated  to  a  location  z  in  the  output  space  Z, 


Theorem  (Snyder  &  Miller  [30]).  Let  {N(A):  A  C  X} 
be  a  Poisson  process  with  an  integrable  intensity  function 
{A(x):  x  G  X}.  Points  of  this  input  point  process  are  trans¬ 
lated  to  the  output  space  Z  to  form  the  output  point  process 
{M(B):  B  C  Z},  where  each  point  is  independently  trans¬ 
lated  according  to  the  transition  density  f(z \x).  Then,  if 
there  are  no  insertions  and  deletions,  {M(B):  B  C  Z}  is  a 
Poisson  process  with  intensity 


f{z\x)\(x)dx. 


Since  the  intensity  of  the  Poisson  process  in  our  model 
is  parametrized  by  the  Euclidean  distances  of  the  points 
(and  not  by  the  points  themselves,  see  previous  Section),  we 
are  going  to  consider  a  random  translation  in  the  distances. 
This  means  that  we  do  not  observe  the  original  distances  but 
noisy  distances.  Let  f(s\r)  be  the  transition  density  which 
defines  the  random  process  which  translates  a  distance  r  in 
the  input  space  to  a  distance  s  in  the  observable  space.  If 
A  (r,  xt ),  defined  in  (1),  is  the  local  rate  of  the  Poisson  pro¬ 
cess  which  defines  the  counting  process  in  the  input  space, 
then  /i(s),  the  intensity  of  the  Poisson  process  in  the  output 
space  is  given  by 


r! 

f(s\r)eeV  (ra)rarm_1 


dr. 


(4) 


R'  is  different  from  the  radius  R  considered  in  the  counting 
process  N(R,xt).  We  consider  R'  >  R  in  (4)  because, 
points  originally  at  distance  greater  than  R  from  xt  can  be 
placed  within  a  distance  less  than  R  after  the  translation 
process.  In  practice,  the  maximum  translation  is  small  (just 
a  perturbation  because  of  the  noise)  and  we  consider  R'  = 
R  +  a  in  the  particular  case  of  a  Gaussian  transition  density 
(11).  The  log-likelihood  of  the  translated  Poisson  process  is 


pR  pR 

L(m(xt),0(xt))  =/log  (n(s,xt))dN(s,xt)-  X  (r,xt)dr. 


The  parameters  of  the  maximum  log-likelihood  are  ob¬ 
tained  by  solving  the  system  of  equations  dL / dm  =  0  and 
dL/dO  =  0.  We  then  obtain  the  following  expression  for  m 
when  we  use  the  k  nearest  neighbors  (fc-NN)  instead  of  the 
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points  within  distance  less  to  R , 


to  the  transition  density  f(Ri  \r)  and  thus  reducing  the  effect 
of  noise.  Using  the  approximation  (9)  in  (6)  we  obtain 


m(xt)= 


1  f(Rj(xt)\r)rm  1  log  dr 

:  “  1  i=i  jf  f(-Rt(xt)lr)rm~1dr 


(5) 


where,  by  an  abuse  of  notation,  we  have  identified  m  = 
m(xt )  in  the  right  hand  side.  Note  that  this  expression  re¬ 
duces  to  the  Levina  and  Bickel  estimator  [23]  in  the  particu¬ 
lar  case  that  f(s\r)  =  S(s  —  r),  i.e.,  there  is  no  translation  of 
the  original  points.  This  corresponds  to  the  ideal  case  with 
no  noise. 

Equation  (5)  is  a  nonlinear  recursive  expression  in  m 
which  is  difficult  to  solve.  Thus,  we  are  going  to  approxi¬ 
mate  it  by  an  easier  to  compute  closed  expression.  Since  the 
translation  density  is  modeling  the  effect  of  noise,  the  effec¬ 
tive  support  of  f(s\r)  is  going  to  be  concentrated  around  s. 
Then,  we  can  substitute  rm_1  in  (5)  by  its  Taylor  expansion 
around  Ri.  Let  us  write  (5)  in  the  following  way 


m(xt) 


1  U  f0R  /(-Ri|r)l°g^rfr 

~1i= l  fc f  f(Ri\r)dr 


(10) 


We  explicitly  estimate,  in  the  following  Section,  the  error 
produced  in  m(xt)  when  we  use  the  approximation  (10)  in¬ 
stead  of  (5),  for  the  particular  important  case  of  a  Gaussian 
transition  density, 


f(s\r) 


, —  exp 
\/27r(7 


(11) 


In  this  particular  case  that  the  coordinates  are  perturbed  by 
Gaussian  noise,  the  error  in  the  Euclidean  distance  can  be 
approximated  by  a  Gaussian  as  well  (see  Appendix  A  for 
more  details).  Thus,  the  expression  for  the  local  dimension 
estimator  becomes 


m{xt)  =  I  1 


1 

k-  1 


k- 1 


-1 


Z* 


(6) 


and  expand  rm  1  in  the  integral  Ii  via  its  Taylor  series 


m(xt) 


1  ^f0R  exp  R{-\^-)  log  ^ dr 

k  ~  1  <=i  foR'  exP  (“  (fl2^r)2)  dr 

(12) 


I  _  /(f  /(-R»|r)rm  1  log  Rk^Ct'>  dr 
fR  f(-Ri\r)rm~1dr 

=  So  f(Rj\r)  log  dr  +  AJjy,  +  . . .  = 

foR'  f(Ri\r)dr  +  AIDi+...  ~  Id,  ’ 

where 

A INi  :=  (m-l)R~l  [  f(Ri\r)(r  -  Ri)  log  SS±SlOA  dr, 
Jo  r 

(7) 

and 

rRf 

AIDi  :=  (m  -  1)^  /  f(Ri\r)(r  -  Ri)dr.  (8) 

Jo 

These  integrals  are  small  since  the  effective  support  of 
f(Ri\r)  has  the  same  order  than  the  level  of  noise  (con¬ 
sidered  not  very  large),  and  the  quantity  (r  —  Ri)  is  small  in 
the  vicinity  of  Ri .  We  can  then  approximate 

L  _  1  /(-Rj|r)log^A)dr 
f0R'  f(Ri\r)dr 

Notice  that  with  this  approximation  of  R,  the  estimator  (6) 
still  reduces  to  the  noise-free  Levina-Bickel  estimator  (2), 
that  is  R  =  log  when  f(Ri\r)  =  6(Ri~r).  In  the  more 
general  case,  (9)  is  the  expected  value  of  log  ^  according 


3.1  Approximation  error  for  a  Gaussian 
translation  density 

In  order  to  estimate  the  error  of  approximating  (5)  by  (10), 
we  compute  the  integrals  (7)  and  (8),  which  are  the  largest 
order  error  terms  of  the  numerator  and  denominator,  respec¬ 
tively,  in  the  approximation  of  m(xt).  For  the  integral  (8), 
notice  that  the  Gaussian  is  even  with  respect  to  Ri  and  that 
(r  —  Ri)  is  odd.  Then,  (8)  is  zero  if  the  effective  support  of 
the  Gaussian  is  within  the  interval  [0,  R'],  that  is  essentially 
if  Ri  e  [3a,  R'  -  3 cr].  If  Ri  G  [0,3 cr]  U  [Rf  -  3cr,  R'},  (8) 
is  bounded  by  4.5cr2  (m  —  1  )/Ri.  We  will  use  this  bound 
for  A  Id.  independently  of  the  value  of  Ri.  Regarding  the 
integral  (7),  we  use  the  Taylor  expansion  of  (r  —  Ri)  log  ^ 
around  Ri , 


(  dm  Rk  (  dm  Rk  (r  -  Ri)2 

(r  -  Ri)  log  —  =  (r  -  Ri)  log  — - - - 

T  lLi  lLi 


+  .... 


Again,  we  consider  the  worst  case  scenario,  Ri  G  [0,  3a]  U 
[Rf  —  3a,  R'],  and  we  obtain 


A  Tv,  <  4.5cr2m  log 

ihi  rii 

We  use  these  bounds  and  error  propagation  theory  to  obtain 
the  relative  error  on  R, 


A R  A/^  A IDi  2m  -  1  f  1  Rk  1 

R  INi  IDi  Ri  Wi  Ri  Id, 
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and  the  relative  error  on  mk(xt), 

A  m(xt)  =  A J  =  1  v  ^  a  j 

m(xt)  ~  I  ~  I(k-  1)  Y 

which  is  bounded  by 

Am(xt)  <  4.5cr2(m(a;t)  - 1)  /  m(xt)  \ 

m(xd  “  min,  V  ™(xt,cr  =  0) )  ’ 

(13) 

where  m(xt,  cr  =  0)  is  (2),  or  equivalently,  (5)  with  <7  =  0, 

and  1  =  Ioi  =  f*  /(i?i|r)rm_1dr.  This  provides 
a  bound  on  the  error  of  the  approximation  for  the  important 
case  of  Gaussian  noise.  Similar  computations  can  be  per¬ 
formed  for  other  translation  density  (noise  models).  In  the 
case  of  cr  =  0  (no  noise),  the  approximation  error  A m(xt) 
is  zero,  as  expected.  If  we  consider  R,  ~  Ri ,  the  bound 
(13)  is  inversely  proportional  to  the  signal  to  noise  ratio  and 
proportional  to  (m  —  1)/ P™-2 ,  which  is  a  decreasing  func¬ 
tion  of  the  dimension  m  for  Ri  >  1.  Note  that  the  estimator 
m(xt),  defined  in  (5),  is  invariant  to  distance  rescalings  so 
we  can  always  ensure  Ri  >  1. 

4  Dimensionality  and  density  estima¬ 
tion  with  simultaneous  soft  cluster¬ 
ing 

Having  introduced  the  critical  translational  Poisson  model, 
we  are  now  ready  to  introduce  the  mixture  of  these  models 
to  address  the  problem  of  stratification  learning  for  noisy 
point  cloud  data.  We  start  with  the  basic  model,  and  then 
introduce  a  regularization  term.  We  conclude  the  presenta¬ 
tion  providing  asymptotic  results. 

4.1  Translation  Poisson  Mixture  Model 
(TPMM) 

In  [15],  we  proposed  to  study  a  stratification  by  extend¬ 
ing  the  Levina  and  Bickel’s  technique.  Instead  of  model¬ 
ing  each  point  and  its  local  ball  of  radius  R  as  a  Poisson 
process  and  computing  the  maximum  likelihood  (ML)  for 
each  ball  separately,  all  the  possible  balls  are  considered  at 
the  same  time  in  the  ML  function.  The  probability  density 
function  for  the  whole  point  cloud  becomes  a  mixture  of 
Poisson  distributions  with  different  parameters  (dimension 
and  density)  in  each  class.  This  allows  for  the  presence  of 
different  intrinsic  dimensions  and  densities  in  the  dataset. 
These  are  automatically  computed  while  being  used  for  soft 
clustering.  We  extend  this  approach  here  to  the  more  gen¬ 
eral  case  when  we  have  mixtures  of  translated  Poisson  pro¬ 
cesses  (thereby  handling  the  noise). 


Let  us  consider  J  different  translated  Poisson  distribu¬ 
tions  in  the  mixture,  each  one  with  a  (possibly)  different 
dimension  m  and  density  parameter  0.  Let  us  denote  by  ip 
the  vector  set  of  parameters,  ip  =  {ipi  =  (7rJ ,  ,  m? ) ;  j  = 

1, . . . ,  J},  where  i R  is  the  mixture  coefficient  for  class  j 
(the  proportion  of  distribution  j  in  the  dataset),  0 J  is  its  den¬ 
sity  parameter  (pJ  =  ee° ),  and  m?  is  its  dimension.  While 
in  the  Levina-Bickel  approach  the  density  is  assumed  lo¬ 
cally  constant  (inside  a  ball)  here  the  density  is  assumed 
constant  inside  a  class  (a  single  Poisson  distribution  defines 
each  class).  However,  if  there  is  a  class  with  different  densi¬ 
ties  the  algorithm  will  cluster  also  according  to  density  (not 
only  dimension)  and  a  single  manifold  will  be  represented 
by  clusters  of  the  same  dimension  but  different  densities. 
An  example  of  that  is  shown  in  Figure  5.  If  the  number  of 
classes  is  not  sufficient  to  represent  the  dimension  and  den¬ 
sity  variability,  the  algorithm  will  give  one  or  more  classes 
with  a  dimension  and/or  density  which  are  the  (weighted) 
average  of  the  actual  features  within  the  class.  This  is  the 
standard  result  for  under-clustering.  On  the  other  hand, 
we  have  experimentally  observed  that  giving  extra  classes 
is  reasonable  robust,  since  the  extra  classes  end-up  being 
empty  or  identical  to  other  classes. 

The  observable  event  is,  as  in  the  Levina-Bickel  ap¬ 
proach,  the  number  of  points  inside  the  ball  B(R,xt)  of 
radius  R  centered  at  point  xt ,  denoted  by  yt  =  N(R ,  xt). 
The  total  number  of  observations  is  T'  and  Y  =  {yt;t  = 
1, . . . ,  T'}  is  the  observation  sequence.  Often,  T'  =  T,  all 
points  in  the  dataset  are  considered.  Let  us  also  denote  by 
p(-)  the  probability  density  function  and  by  P(-)  the  proba¬ 
bility.  The  density  function  of  the  Poisson  mixture  model  is 
given  by 

j 

p{yt \rp)  = 

i=i 

Since  the  observations  follow  a  Poisson  distribution,  and  we 
use  the  translated  Poisson  model  introduced  in  the  previous 
section,  we  have 

p(yt\6\mj)  =  eIoRl°gvRs)dN(s,Xt)e-f0RARr)dr 5 

where  A J(r)  =  e°JV(mJ)mJrTnJ~1  and  juJ(s)  = 

f0R  f{s\r)ee3V{m^)m^rrn3~1dr.  If  Y  contains  T  sta¬ 
tistically  independent  variables  (a  standard  assumption), 
then  the  probability  density  function  of  the  observation  se¬ 
quence  is  the  product  of  the  individual  probability  densities, 
p(yt  |  and  the  log-likelihood  is 

T 

l(y\ ip)  =  iogp(r|v>)  =  EiogKmlVO.  (14) 

t=  1 

Let  us  consider  the  hidden-state  information,  that  is,  which 
mixture  (or  expert)  generates  each  observation.  We  denote 
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by  Z  =  {zt  E  C\  t  =  1, . . . ,  T}  the  set  of  hidden  variables 
and  by  C  =  {C1,  C2, . . .  CJ}  the  set  of  class  labels.  Then, 
zt  =  means  that  the  j-th  mixture  generates  yt.  Using  Z 
we  can  write  the  complete  data  log-likelihood  as 

T  J 

iogP(z,y|V0  =  EE**  log  ,  (15) 

t=i  i=i 


to  compute  mJn+1  we  have  used  the  same  approach  as  in 
[23],  by  means  of  a  fc  nearest  neighbor  graph.  The  TPMM 
approach  just  described  is  summarized  in  R-TPMM  Algo¬ 
rithm  below,  for  the  particular  case  of  a  =  0  (no  regular¬ 
ization,  see  below). 

4.2  Regularized  TPMM 


where  a  set  of  indicator  variables  5Jt,  called  membership 
functions,  is  used  in  order  to  indicate  the  status  of  the  hid¬ 
den  variables: 


6{  =  6(zt,Cj) 


1  if  zt  =  Cj , 

0  otherwise. 


The  unknown  parameters  in  (15)  are:  The  membership 
function  of  an  expert  (class),  5Jt,  the  mixture  probabilities, 
7 rJ,  and  the  parameters  of  each  expert,  and  Qi .  Usually, 

problems  involving  a  mixture  of  experts  are  solved  by  the 
Expectation  Maximization  (EM)  algorithm  [10]  [21,  Chap. 
3].  The  EM  is  based  on  the  following  decomposition  of  the 
log-likelihood  (14): 


T  J 

l{y\i[>,  h)  =  log  [piytW)^3] 

7  7  (16) 

“EEywios  [h3(vt)\ , 

t= i  j= i 

where  if  =  {hj(yt)  <  1  ;t  =  1  =  1 

and  hi  (yt)  is  the  probability  that  observation  t  belongs  to 
mixture  j:  hj(yt)  =  Ez[63t\yt,ip]  =  P{S3t  =  1| yt,ip), 
where  Ez(-)  is  the  expectation  with  respect  to  Z.  Since  the 
membership  functions  are  indicator  variables,  the  first  term 
in  (16)  is  the  expectation  of  (15)  with  respect  to  Z.  Also 
notice  that  the  second  term  is  the  entropy  of  the  membership 
functions. 

An  interesting  interpretation  of  the  EM  algorithm  is 
introduced  in  [17],  where  the  EM  is  seen  as  an  alter¬ 
nate  optimization  algorithm  of  the  log-likelihood  (16). 
Then,  the  E-step  is  nothing  else  than  the  maximization  of 
L(Y\7>i  H)  with  respect  to  H  with  the  additional  constraint 
that  J7j=i  W(yt)  =  1  f°r  each  observation  t  =  1, . . . ,  T. 
Thus,  the  variables  hi  (yt)  at  step  n  +  1  of  the  optimization 
algorithm  are 


K+M 


E=iP(yt\mln^hyn 


(17) 


The  TPMM  algorithm  seeks  a  soft  clustering  according  to 
dimensionality  and  density,  considering  noise  in  the  data, 
but  does  not  (explicitly)  take  into  account  spatial  informa¬ 
tion.  Adding  regularization  is  the  goal  of  this  section.  Regu¬ 
larization  further  helps  to  improve  the  classification  in  noisy 
data  and  points  lying  close  to  manifold  edges  (see  results  in 
figures  1  and  2).  This  regularization  is  inspired  in  part  by 
the  work  in  [1]  for  the  neighborhood  EM  (NEM),  where  the 
authors  extend  the  EM  algorithm  adding  spatial  constraints. 
This  neighborhood  spatial  information  is  introduced  as  a  pe¬ 
nalization  term  in  the  log-likelihood,  following  Hathaway’s 
EM  interpretation  [17].  In  our  context,  we  complete  (16) 
with  a  spatial  term  S(H ), 

m  H)  =  L(Y\ ^  H)  +  aS(H),  (18) 

where  a  is  a  parameter  that  controls  the  tradeoff  between 
the  spatial  term  and  the  likelihood.  Its  value  is  also  re¬ 
lated  to  the  amount  of  noise  in  the  data.2  Then,  function 
F  is  maximized  with  an  alternate  optimization  technique. 
Since  the  new  term,  S,  only  depends  on  H,  the  optimiza¬ 
tion  procedure  results  in  a  EM-type  algorithm  with  a  mod¬ 
ified  membership  probability  that  not  only  depends  on  the 
likelihood  but  also  on  the  spatial  criteria.  The  NEM  algo¬ 
rithm  uses  (note  the  similitude  with  MRFs,  see  below) 

T  J 

Snem(H)  =  EE  h3  (yt)E/l'J  (2//)> 

t= 1  j=l  l~t 

where  l  ~  t  indicates  that  there  is  a  neighborhood  relation¬ 
ship  between  observations  l  and  t.  By  maximizing  this  term, 
we  want,  for  each  observation  t ,  as  many  neighbors  as  pos¬ 
sible  with  high  probability  of  belonging  to  the  same  class  as 
observation  t,  thus  regularizing  the  classification.  However, 
we  will  use  a  more  general  expression  for  S(H)  based  on 
a  dissimilarity  measure,  V ,  between  every  observation  and 
other  observations  in  the  sequence, 

T  J 

S(H)^-^2^2h3(yt)V(t,j,X,H).  (19) 

t=  1  j  =  l 


In  the  same  way,  variables  ^  are  obtained  by  maximizing 
L(Y\7>,  H)  with  respect  to  7  with  an  additional  constraint 
for  the  mixture  probabilities:  J2j= l  ^=1.  This  gives  equa¬ 
tions  (21)-(23)  for  the  variables  at  step  n  +  1.  In  order 


2The  study  of  the  possible  connection  between  the  regularization  fac¬ 
tor  a  and  the  level  of  noise  and  the  translation  density  in  the  translation 
Poisson  model  is  an  interesting  subject  of  future  research.  Note  that  this 
regularization  is  important  beyond  the  noise,  e.g.,  at  manifolds  edges,  see 
experimental  results. 
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The  expression  (19)  provides  a  generic  framework  for  intro¬ 
ducing  constraints  in  the  soft  classification,  besides  the  ones 
already  present  in  the  TPMM  model,  namely  dimensional¬ 
ity  and  density.  One  possibility,  as  in  the  NEM  algorithm,  is 
to  introduce  spatial  regularity.  Then,  as  dissimilarity  mea¬ 
sure  we  use  V  =  Vr  defined  as 


©*“£(!  -h’(Vl)). 


Different  neighborhoods  definitions  in  Vr  result  in  differ¬ 
ent  kinds  of  regularization.  A  natural  choice  is  the  man¬ 
ifold  neighborhood,  for  that,  we  can  define  as  neighbors 
the  k  nearest  neighbors.  However,  for  specific  applications 
one  might  be  interested  in  other  neighborhoods,  e.g.,  pixel 
neighborhoods  or  contiguous  frames  in  video  applications 
(see  experiment  in  Figure  10  and  Table  5). 

As  noted  in  [1],  the  EM  algorithm  with  additional  con¬ 
straints  can  be  seen  as  finding  the  Gibbs  distribution  with 
energy  — F(0,i7).  In  the  particular  case  when  the  ad¬ 
ditional  constraint  is  neighborhood  dependent,  Snem(H ) 
and  S(H)  with  Vr ,  the  Gibbs  distribution  defines  a  Markov 
Random  Field. 

The  maximization  of  F  (Equation  (18)),  is  obtained  as 
in  [1],  with  an  alternate  optimization  technique  which  re¬ 
sults  in  an  EM-type  algorithm.  Maximizing  (18)  with  re¬ 
spect  to  H ,  with  S(H)  defined  in  (19)  -  with  the  constraints 
J2j= i  ^  (yt)  =  1  for  each  observation  t  =  1, . . . ,  T,  by 
means  of  Lagrange  multipliers  -  results  in  the  following  ex¬ 
pression  for  the  membership  probabilities: 


h3(Vt)  = 


p(y  i 


<t\m 


3,8j)Ke~aV'(t’:i’x’H) 


'Yld=\P{yt\ml 1 0l)nle  oiV(t,i,x,H) 


(20) 


where  V'(t,l,X,H)  =  1  —  2 hj(yi))  in  the  par¬ 

ticular  case  we  are  interested:  V  =  Vr,  and  assuming 
that  l  ~  t  implies  t  ~  1.  Since  the  only  term  in  (18) 
which  depends  on  0  is  L(Y\i/j,H),  the  optimal  values  of 
'ipi  =  { (7rJ ,  Qi ,  )  for  j  =  {1, . . . ,  J}}  do  not  change 

with  respect  to  the  original  TPMM  algorithm.  The  regular¬ 
ized  version  of  the  TPMM  algorithm  is  summarized  in  the 
R-TPMM  Algorithm  below  (Regularized  Translated  Pois¬ 
son  Mixture  Model  Algorithm). 

The  EM  suffers  from  local  maxima,  this  can  be  allevi¬ 
ated  running  the  algorithm  several  times  with  different  ini¬ 
tializations.  In  particular,  we  add  to  the  EM  iterations  an 
extra  loop  where  the  parameters  raJ  and  of  each  class 
are  reinitialized  every  odd  iteration  and  i rJ  every  even  iter¬ 
ation. 


R-TPMM  Algorithm 


REQUIRE:  The  point  cloud  data,  J  (number  of  desired  classes),  k 
(scale  of  observation),  a  (regularization  parameter),  and  a  (noise 


level  or  full  noise/translation  function  /). 

ENSURE:  Regularized  soft  clustering  according  to  dimensionality 
and  density. 

1.  Compute  the  local  estimators 


m(xt)  = 


k  —  1  rR' 


3  =  1 


Jo  f(Ri(xt)\r)  log  -*fU-dr 
C  f(Ri(xt)\r)dr 


6{xt)  =  log  {(k  -  1)/  (v(m(xt))Rk(xt)m('xt^ 


In  particular,  we  use  the  definition  of  /  given  in  (1 1). 

2.  Initialize  0o  =  {7rQ,mJ0,  6j0}  and  0o  =  {ttq,  fhJ0 ,  0Jo}  to  any 

set  of  values  which  ensures  that  JA  —  1  and 

Ho  =  {i K(yt )  =  1/ J;  j  =  1, . . . ,  J,  t  =  1, . . . ,  T}. 

3.  Iterations  on  l. 


3A.  If  l  is  odd 

Set  fhi  =  mJ0  and  0j  —  Oq,  for  all  j  =  1, ... ,  J. 
Else 

Set  7 if  =  1/ J,  for  all  j  =  1, . . . ,  J. 

3B.  Iterations  on  n. 

For  all  j  =  1, . . . ,  J: 

3B.1:  Compute,  for  alH  =  1, . . . ,  T, 


K+i  (yt) 


3, X,Hn) 

EiLi  p(yt\mln,  eiwne-«v^x,Hn)  ’ 


where  Hn  =  {h3n(yt)\  j  =  1, . . . ,  J,  t  =  1, . . .  ,T}. 

3B.2:  Compute 


rt+i  =  Tp'YjKiyt 


(21) 


°n+ 1~ 


Jn(yt)m(xt)  1/'Y^K{yt) 


(22) 


P3n+1=e™+' 


^2hi(yt)f(xt)  V  hJn(yt) 


(23) 


where  p(xt)  =  ee(^Xt\ 

Until  convergence  of  0 n ,  that  is,  when  |  |0n+i  —  1 12  < 

e,  for  a  certain  small  value  e. 

Set  0z+i  =  and  Hi+1  —  Hn. 


Until  ||0z+i  —  0z||2  <  e,  ||Efz+i  —  H1W2  <  e  or  l  =  /max-3 


3  In  the  experiments  we  use  /max  =  10 
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Remark  1.  The  PMM  and  R-PMM  algorithms  introduced 
respectively  in  [15]  and  [16]  are  particular  cases  of  the  pa¬ 
rameters  a  ( regularization )  and  a  (noise)  in  the  R-TPMM 
algorithm  .  Let  us  introduce  the  following  notation  for  the 
particular  cases  of  these  parameters: 

•  PMM:  a  =  0  and  a  =  0. 

•  R-PMM:  a  >  0  and  a  =  0. 

•  TP  MM:  a  =  0  and  a  >  0. 

•  R-TPMM: a  >  0  and  a  >  0. 

We  will  use  the  above  notation  in  the  experiments  in  Section 
5. 

Remark  2.  Notice  that  the  estimators  (22)- (23)  in  the 
PMM  and  R-PMM  approaches  (cr  =  0)  are  weighted 
harmonic  means  of  the  local  estimators  (2)-(3)  of  Levina- 
Bickel.  The  weight  at  each  point  is  the  probability  of 
the  membership  function,  h.  In  the  particular  case  of 
a  unique  class,  J  =  1,  we  obtain  the  global  dimen¬ 
sion  estimator  proposed  by  MacKay  and  Ghahramani 
( http://www.  inference.phy.  cam.  ac.  uk/mackay /dimension/), 
a  particular  case  of  our  proposed  framework. 

Remark  3.  We  are  using  the  same  level  of  noise  a  for  all 
the  clusters.  A  better  approach  might  be  to  use  a  a  suitable 
for  each  class.  Although  computationally  speaking  it  will 
be  more  demanding,  since  we  would  have  to  recompute  the 
local  estimators  m(xt)  and  0(xt)  at  each  iteration  with  the 
a  of  the  assigned  class.  Moreover,  the  different  a  would 
have  to  be  estimated  ( this  can  be  done  for  example  as  the 
value  of  a  which  minimizes  the  estimated  dimension  in  each 
class). 

As  proved  in  [2],  if  a  is  small  enough,  (18)  has  a  guar¬ 
anteed  global  maximum  for  a  fixed  value  of  and  the 
additional  term  S(H )  does  not  affect  the  convergence  of 
the  EM-type  algorithm.  It  can  be  shown  (see  Appendix  B) 
that,  for  the  case  of  Vr,  the  corresponding  bound  on  a  is 
aR  <  1/(2 k). 

Using  the  same  analysis  as  in  Section  3. 1  we  find  that  the 
relative  error  produced  in  (22)  by  using  the  approximation 
(10)  for  m(xt)  is 

A <  4.5cr2(mJ  —  1)  /  \ 

m°  minM  '  rnJ(a  =  0)J 

where  mj (cr  =  0)  is  (22)  with  a  =  0,  and  R^yt)™3 -1  = 

^Di  ( Vt  )• 


4.3  Asymptotic  analysis 

Levina  and  Bickel  show  in  [23]  that  under  the  assumptions 
T  — »  oo,  k  — >  oo,  and  k/T  — >  0,  that  is  when  the  Poisson 
approximation  is  correct,  the  mean  and  variance  of  the  di¬ 
mension  estimator  (2)  (with  k  —  2  instead  of  k  —  1  in  the 
denominator)  are 

m2 

E[m(xt)\  =  mT,  Var[ra(xt)]|  =  J  , 

where  mT  is  the  actual  dimension.  We  can  apply  the  same 
type  of  analysis  to  our  model  in  the  particular  case  of  hard 
clustering,  that  is 

1  if  j  =  ttrgmaxJ  /;.!  (y(), 

0  otherwise. 

We  assume,  in  addition,  that  all  the  points  that  belong  to 
class  j  are  well  classified.  Then,  we  obtain  the  following 
results 

t- 

Var[mJ]  =  «)!0  ( (t  _  _  4)  , 

where  mJT  is  the  correct  intrinsic  dimension  of  class  j  and 
Nj  is  the  amount  of  points  classified  as  class  j.  See  Ap¬ 
pendix  C  for  the  details  of  the  proof.  This  result  shows 
that  the  dimension  estimator  of  each  class  is  more  biased 
when  the  intrinsic  dimension  increases.  On  the  other  hand, 
when  there  are  more  points  in  a  class  (Nj  is  larger),  the  bias 
is  reduced.  It  is  reduced  also  by  considering  more  near¬ 
est  neighbors,  although  there  is  a  compromise  for  this  value 
since  increasing  k  affects  the  underlying  hypothesis  of  con¬ 
stant  density  inside  the  ball.  We  have  verified  this  result 
experimentally  and  we  found  that  the  bias  results  in  a  un¬ 
derestimation  of  the  intrinsic  dimension  (this  behavior  was 
also  observed  in  the  Levina-Bickel  estimator  [23]),  and  that 
it  depends  on  the  intrinsic  dimension  but  not  on  the  ambi¬ 
ent  dimension.  We  have  also  experimentally  observed  that  a 
possible  bias  in  the  estimated  dimension  does  not  affect  the 
clustering,  unless  the  bias  makes  the  estimated  dimension  of 
one  class  to  be  close  to  any  of  the  other  clusters  estimated 
dimensions. 

The  analysis  of  the  density  estimator  Qi  is  the  subject  of 
current  research,  as  it  is  the  study  of  the  asymptotic  behavior 
for  the  full  soft  clustering  model. 

5  Experimental  results 

We  now  present  experimental  results  with  synthetic  and  real 
data  for  the  proposed  R-TPMM  and  its  variants.  We  also 
compare  some  of  the  results  with  the  ones  obtained  with 


GPCA  [35]  and  the  Souvenir  and  Pless  [31]  algorithms.  We 
fixed  a  and  a  experimentally.  For  a  we  usually  use  val¬ 
ues  in  the  interval  [0,  5].  As  for  the  case  of  a  we  use  a 
value  in  the  order  of  the  mean  distance  to  the  first  neighbor: 
a  =  vR\,  where  R\  =  ^^2tRi(xt)  and  0  <  v  <  1. 
In  the  experiments  with  real  data  -  digits,  faces,  video  ac¬ 
tivities,  and  motion  -  we  use  the  following  values  for  v\ 
0.4, 0.4,  0.25,  and  1  respectively.  In  the  first  (artificial  data) 
experiment,  since  we  know  the  level  of  noise  in  the  point 
coordinates,  we  use  the  estimated  a  as  computed  in  Ap¬ 
pendix  A.  The  only  parameter  in  GPCA  is  the  number  of 
clusters.  In  the  Souvenir-Pless  algorithm  the  input  parame¬ 
ters  are  the  number  of  nearest  neighbors  and  the  dimension 
of  each  cluster.  We  also  fixed  these  parameters  experimen¬ 
tally  in  order  to  obtain  the  best  classification  results. 


5.1  Synthetic  data 

First,  we  work  with  a  point  cloud  formed  by  300  samples  of 
a  spiral  and  800  of  a  plane,  both  in  3D  embedding  space.  We 
compare  the  following  algorithms:  PMM,  R-PMM,  TPMM, 
R-TPMM,  GPCA,  and  Souvenir-Pless.  Figure  1  shows,  for 
each  algorithm,  the  point  cloud  with  each  point  colored  and 
marked  differently  according  to  its  classification.  In  the  dif¬ 
ferent  versions  of  our  proposed  algorithm  we  set  k  =  30, 
J  =  2,  a  =  0.75,  and  a  =  0.1.  We  test  TPMM  and  R- 
TPMM  with  a  small  value  of  cr  different  than  zero  even  if 
there  is  no  noise  just  to  show  that  a  small  error  in  the  esti¬ 
mation  of  a  does  not  significantly  affect  the  result.  Notice 
that  the  regularized  versions  of  our  proposed  algorithm  im¬ 
prove  the  classification  at  the  edges.  In  the  Souvenir-Pless 
algorithm  we  use  k  =  10  and  dimensions  2  and  2  (it  gives 
a  better  result  than  using  the  actual  dimensions,  2  and  1,  as 
parameters).  The  GPCA  algorithm  does  not  give  good  re¬ 
sults  because  it  is  designed  for  linear  manifolds.  Table  1 
contains  quantitative  results  of  the  different  versions  of  our 
algorithm.  Our  approach  gives  some  errors  at  the  intersec¬ 
tion  of  the  two  manifolds.  This  is  due  to  the  fact  that  a  point 
in  the  intersection  has  points  of  the  other  manifold  as  some 
of  its  nearest  neighbors.  Thus,  the  extent  of  the  classifica¬ 
tion  errors  in  the  intersection  depends  on  the  amount  of  k 
nearest  neighbors  considered.  Note  that  the  GPCA  handles 
intersections  well  when  working  with  linear  manifolds  [18]. 
Addressing  this  problem  is  part  of  the  ongoing  efforts  in 
the  extensions  of  the  proposed  stratification  learning  frame¬ 
work. 

Next,  we  added  Gaussian  noise,  with  standard  deviation 
0.66,  to  the  point  coordinates.  Then,  if  we  approximate  the 
transition  density  with  a  Gaussian  (see  Appendix  A),  we  use 
the  estimated  standard  deviation  a  =  0. 66\/2  =  0.93.  The 
rest  of  the  parameters  we  use  are  k  =  40 ,  J  =  2,  a  =  1, 
and  for  Souvenir-Pless,  k  =  20  and  dimensions  2  and  2. 
The  qualitative  comparison  of  the  different  algorithms  can 


PMM  j| 

1  R-PMM  j| 

|  TPMM  | 

|  R-TPMM 

E 

estimated  parameters  for  each  class 

Cl 

C2 

Cl 

C2 

Cl 

C2 

Cl 

C2 

m 

1.90 

1.02 

1.90 

0.99 

1.87 

1.03 

1.87 

1.00 

e 

1.01 

1.10 

0.99 

1.14 

1.05 

1.09 

1.02 

1.12 

Number  of  points  in  each  class 

pi. 

787 

13 

800 

0 

788 

12 

800 

0 

Sp. 

21 

279 

22 

278 

21 

279 

23 

277 

Table  1:  Estimated  parameters,  dimension  m  and  density 
p  =  ee,  in  each  class  ( Cl  and  C2),  and  clustering  results  of 
a  plane  and  a  spiral  (denoted  by  Pl.  and  Sp.  respectively). 
The  four  algorithms  use  k  =  30  and  J  =  2. 


be  seen  in  Figure  2.  Again,  notice  how  the  classification 
of  the  points  at  the  edges  is  better  in  the  regularized  ver¬ 
sions.  Table  2  contains  the  quantitative  results  for  the  dif¬ 
ferent  variants  of  the  proposed  algorithm.  In  particular,  it 
can  be  seen  that  the  translated  versions  give  an  estimation 
for  the  dimension  m  less  sensible  to  noise. 


(d)  R-TPMM  (e)  GPCA  (f)  Souvenir-Pless 

Figure  1 :  Clustering  of  a  spiral  and  a  plane.  Results  with 
different  algorithms  ( this  is  a  color  figure). 


Due  to  the  statistical  nature  of  the  R-TPMM  approach, 
it  is  not  restricted  to  linear  manifolds,  such  as  GPCA  [36], 
nor  to  Euclidean  manifolds,  such  as  Isomap  [33],  on  which 
the  Souvenir-Pless  technique  [31]  is  based  on.  This  is  man¬ 
ifested  in  Figure  3,  where  we  cluster  a  sphere  and  a  curved 
line  and  compare  the  results  of  the  R-TPMM,  GPCA  and 
Souvenir-Pless.  The  R-TPMM  gives  a  100%  accurate  clus¬ 
tering  and  the  estimated  dimensions  are  0.98  for  the  line  and 
2.05  for  the  sphere. 

Robustness  to  outliers  has  been  studied  for  clustering  lin¬ 
ear  manifolds.  In  [37]  a  robust  GPCA  algorithm  is  pro¬ 
posed.  A  segmentation  of  linear  subspaces  based  on  infor¬ 
mation  theory  is  proposed  in  [25],  and  it  has  been  shown 
to  be  robust  to  outliers.  In  order  to  see  how  the  R-TPMM 
performs  in  the  presence  of  outliers,  we  have  added  outliers 
to  a  set  of  points  sampling  a  spiral  and  a  plane.  The  original 
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c r 

i  fT 

cr 

(a)  PMM 

(b)  R-PMM 

(c)  TPMM 

cr 

cr 

or 

(d)  R-TPMM 

(e)  GPCA 

(f)  Souvenir-Pies s 

Figure  2:  Clustering  of  a  spiral  and  a  plane  with  noise. 
Results  with  different  algorithms  ( this  is  a  color  figure ). 


PMM  || 

||  R-PMM  | 

|  TPMM  | 

|  R-TPMM 

E 

estimated  parameters  for  each  class 

Cl 

C2 

Cl 

C2 

Cl 

C2 

Cl 

C2 

m 

2.47 

1.51 

2.48 

1.43 

1.86 

1.35 

1.87 

1.32 

0 

0.13 

0.03 

0.15 

0.03 

0.87 

0.34 

0.83 

0.40 

Number  of  points  in  each  class 

pi. 

764 

36 

800 

0 

784 

16 

800 

0 

Sp. 

22 

278 

25 

275 

27 

273 

29 

271 

Table  2:  Estimated  parameters,  dimension  m  and  density 
p  =  ee ,  in  each  class  ( Cl  and  C2 ),  and  clustering  results 
of  a  plane  and  a  spiral  with  noise  (denoted  by  PL  and  Sp. 
respectively).  The  four  algorithms  use  k  =  40  and  J  =  2. 


point  coordinates  are  within  the  intervals  [—11,21],  [5,  25], 
and  [—11, 14].  The  outliers  follow  a  uniform  distribution 
within  the  intervals  [—30, 30],  [—15, 35],  and  [—30, 30].  We 
use  a  =  0.1  and  a  =  0.1.  Figure  4  shows  the  classifica¬ 
tion  results  for  different  amounts  of  outliers  and  number  of 
classes:  a)  J  =  2,  2.5%  outliers;  b)  J  =  3,  2.5%  outliers; 
c)  J  =  3,  50%  outliers;  d)  J  =  3,  50%  outliers;  e)  J  =  3, 
75%  outliers;  f)  J  =  3,  100%  outliers;.  In  the  experiment 
(a)  we  set  two  classes  and  we  obtain  a  class  formed  by  the 
spiral  and  the  outliers  with  an  estimated  dimension  of  1.10, 
the  second  class  is  the  plane  with  an  estimated  dimension 
of  1.87.  Note  that  the  estimation  of  the  embedding  dimen¬ 
sions  are  not  affected  by  the  outliers  when  its  percentage  is 
relatively  low  and  we  do  not  allow  an  extra  class  for  the  out¬ 
liers.  If  we  set  three  classes,  experiments  (b)-(f),  the  outliers 
are  clustered  as  a  separate  class,  and  the  only  errors  being 
at  the  intersections  (due  to  points  whose  nearest  neighbors 
actually  belong  to  different  classes).  The  dimensions  ob¬ 
tained  in  each  experiment  are  the  following:  (b)  1.06,  1.87, 
and  10.46;  (c)  1.17,  1.88,  and  3.32;  (d)  1.23,  1.87  and  2.92; 
(e)  1.29,  1.83  and  2.97;  (f)  1.28,  1.81  and  2.92.  Indepen¬ 
dently  of  the  amount  of  outliers,  the  algorithm  is  able  to 


(a)  R-TPMM  (b)  GPCA  (c)  Souvenir-Pies s 

Figure  3:  Clustering  of  a  sphere  and  a  curve  with  R-TPMM 
(k  =  20,  J  =  2,  a  =  0.1  and  a  =  0),  GPCA  and  Souvenir- 
Pless  (k  =  20,  dimensions  2  and  1 ).  R-TPMM  works  well 
in  non-Euclidean  manifolds.  The  estimated  dimensions  for 
each  cluster  in  (a)  are  0.98  and  2.05  for  clusters  in  green 
and  red  respectively  ( this  is  a  color  figure ). 


cluster  apart  the  spiral  and  the  plane,  and  the  correct  esti¬ 
mation  of  their  embedding  dimensions  is  not  affected  by  the 
outliers.  When  the  amount  of  outliers  is  small,  2.5%  in  (c), 
the  estimated  dimension  for  the  class  ‘outliers’  is  very  large 
due  to  the  small  amount  of  points  belonging  to  this  class, 
and  we  do  not  have  enough  samples  of  the  class  ‘outlier’ 
in  each  ball  (there  are  mixed  samples  from  the  spiral  and/or 
the  plane).  In  these  balls,  the  assumption  of  approximate 
constant  density  is  not  satisfied  either.  When  the  amount  of 
outliers  is  larger,  their  estimated  dimension  is  the  same  as 
the  ambient  dimension,  since  there  is  no  intrinsic  structure 
for  these  points. 


(a)  2.5%  outliers  (b)  2.5%  outliers 


(c)  25%  outliers 


(d)  50%  outliers 


(e)  75%  outliers  (f)  100%  outliers 


Figure  4:  R-TPMM  clustering  of  a  spiral  and  a  plane  with 
different  amount  of  added  outliers,  k  =  30,  a  =  0.1  and 
a  =  0.1.  Example  (a)  is  when  considering  two  classes  and 
the  rest  with  three  classes  (this  is  a  color  figure ). 


The  experiment  in  Figure  5  illustrates  how  the  soft  clus¬ 
tering  is  done  according  to  both  dimensionality  and  density. 
The  data  consists  of  2000  points  on  the  Swiss  roll,  400  on 
a  line  with  high  density  and  50  on  another  less  dense  line. 
We  have  set  J  =  3  and  the  algorithm  clusters  the  line  in 
two  different  classes,  according  to  the  different  densities. 
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The  estimated  dimensions  are:  1.98,  1.02  and  0.99.  And 
the  estimated  densities:  0.49,  0.53  and  6.89  respectively. 


Figure  5 :  Clustering  of  a  Swiss  roll  and  a  line  with  two 
different  densities  with  R-TPMM,  k  =  20,  J  =  3,  a  =  2 
and  <j  =  0 ,  no  noise  (this  is  a  color  figure ). 

Before  presenting  results  on  real  data,  we  show  how  the 
regularization  parameter  a  affects  the  classification.  Figure 
6  shows  the  evolution,  according  to  a ,  of  the  classification 
of  the  spiral  and  the  plane  by  the  R-PMM  with  Vr.  We 
have  perturbed  50  randomly  picked  samples  of  the  spiral 
by  a  Gaussian  noise  of  standard  deviation  a  =  0.66.  It 
can  be  observed  that  a  small  amount  of  regularization  helps 
in  the  classification,  but  when  a  increases,  it  produces  a 
larger  diffusion  of  the  labelling,  resulting  in  an  inaccurate 
classification.  This  is  due  to  the  fact  that  the  regularization 
component  gains  more  importance  than  the  log-likelihood 
term.  In  the  limit,  when  a  is  quite  large,  the  optimal  so¬ 
lution  is  a  single  class.  Of  course,  the  “optimal”  value  of 
a ,  which  marks  these  transitions,  depend  on  each  particular 
experiment  (this  is  common  in  MRF-type  approaches,  and 
the  study  of  techniques  from  there  to  automatically  compute 
this  regularization  parameter  is  an  interesting  open  prob¬ 
lem).  For  the  experiments  in  this  paper,  we  have  worked 
with  values  of  a  within  the  interval  [0,  5].  In  [16]  we  also 
tested  the  evolution  of  the  classification  with  respect  to  a 
with  another  term  T>c  in  the  additional  constraint  S(H)  so 
as  to  impose  spatial  compactness  within  a  class.  Again,  the 
dimension/density  criterion  for  classification  was  more  pe¬ 
nalized  against  the  extra  term  S(H)  for  larger  values  of  a 
and  thus  yielding  rather  a  k-means  kind  of  clustering  (note 
that  within  the  context  of  GPCA,  [24]  also  proposed  a  com¬ 
bination  of  k-means  and  dimensionality  clustering). 


5.2  Real  data 

As  a  test  of  the  performance  with  real  data,  we  first  work 
with  the  MNIST  database  of  handwritten  digits,4  which  has 
a  test  set  of  10,000  examples.  Each  digit  is  an  image  of 
28  x  28  pixels  and  we  treat  the  data  as  784-dimensional  vec¬ 
tors.  We  analyze  the  mixture  of  digits  one  and  two,  some 
examples  of  those  scanned  digits  as  well  as  the  clustering 


(d)  OL  =  8  (e)  a  =  20  (f)  a  =  25 

Figure  6:  Clustering  of  a  ID  spiral  and  a  2D  plane  (R- 
PMM,  k  =  30,  J  =  2).  Evolution  of  the  classification  as 
the  regularization  parameter  a  increases. 


results  are  in  Figure  7.  Observe  how  the  classification  im¬ 
proves  adding  regularization  and  including  the  noise  in  the 
model  (Translated  Poisson).  We  have  used  R-PMM  with 
a  =  4,  TPMM  with  a  =  1.5,  and  R-TPMM  with  a  =  1 
and  a  =  1.5.  Levina-Bickel’s  technique  gives  a  dimension 
value  of  11.26  and  Costa-Hero’s  9.  These  methods  give  a 
dimension  in  between  the  two  different  dimensions  present 
in  the  point  cloud.  With  the  R-TPMM  algorithm  (and  its 
variants),  we  are  able  to  separate  the  points  (images)  corre¬ 
sponding  to  each  digit  and  handle  the  noise  and  regulariza¬ 
tion.  Both  sets  of  digits  have  different  dimensionality  and 
density.  We  have  observed  that  some  other  digits  do  have 
the  same  dimensionality,  as  expected.  Observe  in  the  Table 
of  Fig.  7  how  the  dimension  is  reduced  with  the  (R-)TPMM, 
these  values  are  much  closer  (than  the  ones  with  (R-)PMM) 
to  the  dimension  obtained  with  Isomap,  see  graph  in  Figure 
8,  applied  to  each  one  of  the  digits  by  separate.  The  fact  that 
the  dimension  is  reduced  when  considering  the  translated 
process  indicates  that  the  high  dimensions  were  originally 
due  to  the  noise  (this  can  be  also  inferred  by  observing  the 
Isomap  eigenvalues  in  Fig. 8). 


Isomap  dimensionality 


Isomap  dimensionality 


(a)  Digit  T’ 


(b)  Digit  ’2’ 


Figure  8:  Isomap  dimensionality  of  Digits  one  and  two.  The 
graph  shows  the  residual  variance  of  the  first  ten  Isomap 
embedding  dimensions. 


4http://yann.lecun.com/exdb/mnist/ 
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(I  PMM  | 

|  R-PMM  | 

|  TPMM  [ 

|  R-TPMM  | 

|  Estimated  parameters  for  each  class  | 

Cl 

C2 

Cl 

C2 

Cl 

C2 

Cl 

C2 

m 

7.33 

12.79 

7.36 

12.95 

2.86 

7.14 

2.88 

7.24 

0 

-7.38 

-23.99 

-7.67 

-23.26 

-1.52 

-12.70 

-1.62 

-12.90 

|  Number  of  points  in  each  class  | 

‘V 

1032 

0 

1032 

0 

1032 

0 

1029 

3 

‘2’ 

70 

1065 

43 

1092 

36 

1099 

17 

1118 

Figure  7:  Clustering  of  scanned  digits  ‘1 7  and  ‘ 2 .’  Some  examples  of  digits  and  table  with  estimated  parameters ,  dimension 
m  and  density  p  =  ee ,  for  each  class  (Cl  and  C2),  and  clustering  results  for  different  variants  of  the  R-TPMM  algorithm 
with  J= 2,  A;  =  30  (since  the  density  is  p  =  ee ,  p  >  0  for  0  E  IR). 


We  also  analyze  images  from  the  Yale  Face  Database 
B,5  which  contains  images  of  10  subjects  under  585  view¬ 
ing  conditions  (9  poses  and  65  illumination  conditions),  see 
Fig.  9.  Each  image  has  a  size  of  640  x  480  pixels.  For  com¬ 
putational  reasons  we  subsampled  the  images  by  a  factor  of 
ten  and  use  each  64  x  48  image  as  a  vector  in  a  high  di¬ 
mensional  space.  We  analyze  the  point  cloud  formed  by  the 
585  images  of  subject  5  (varying  pose  and  illumination)  to¬ 
gether  with  the  65  images  of  subject  6  only  in  the  first  pose 
and  under  varying  illuminations.  The  estimated  dimensions 
and  confusion  matrices  using  the  PMM  and  R-TPMM  al¬ 
gorithm  with  a  =  0.25  and  a  =  1  are  presented  in  Table 
3.  Note  how  both  subjects  are  well  separated,  and  the  set  of 
images  of  subject  5  has  a  dimension  one  unity  larger  than 
the  dimension  for  subject  6,  since  we  do  not  consider  the 
pose  variation  for  this  subject.  The  classification  results  are 
improved  using  regularization  and  the  translated  Poisson 
model.  Observe  also  that  changing  the  number  of  k  nearest 
neighbors  does  not  significantly  change  the  results.  Table  4 
contains  the  confusion  matrix  obtained  with  the  GPCA  and 
the  Souvenir-Pless  algorithms.  These  algorithms  are  com¬ 
puted  with  a  pre-projection  of  the  data  onto  a  5 -dimensional 
space6.  This  is  necessary  in  the  GPCA  because,  although 
not  being  an  iterative  algorithm,  it  consumes  a  lot  of  time  in 
high  dimensional  spaces.  For  the  Souvenir-Pless  algorithm 
this  point  is  not  so  critical  but  we  obtained  better  classifica¬ 
tion  results  in  the  reduced  dimensionality  space.  However, 
with  the  proposed  R-TPMM  we  obtain  better  results  in  the 
original  space. 

It  must  be  clarified  that  the  R-TPMM  is  able  to  sepa¬ 
rate  both  subjects  because  their  corresponding  images  lie  in 
manifolds  of  different  dimensions.  However,  if  we  consider, 
for  example,  a  fixed  pose  under  varying  illuminations,  in 
both  subjects,  all  the  points  are  classified  in  the  same  class 
since  both  manifolds  have  the  same  dimension  (complex¬ 
ity).  In  this  particular  case,  we  tested  the  GPCA  algorithm 
and  it  gives  a  100%  accurate  classification. 

5http://cvc.yale.edu/projects/yalefacesB/yalefacesB.html 

6 We  compute  the  SVD  of  the  matrix  data  /  =  UTiVt  and  consider 
the  matrix  formed  by  the  first  5  columns  of  VT  as  the  embedded  data.  In 
GPCA  we  further  use  homogeneous  coordinates  [36] 


PMM 

R-TPMM 

k  =  35  ||  k  =  50 

k  =  35  ||  k  =  50 

|  Estimated  dimension  for  each  class  | 

Cl 

C2 

Cl 

C2 

Cl 

C2 

Cl 

C2 

m 

4.10 

2.94 

4.37 

2.79 

3.34 

2.59 

3.60 

2.55 

|  Number  of  points  in  each  class  | 

S.  5 

S.  6 

569 

0 

16 

65 

575 

0 

10 

65 

584 

0 

1 

65 

584 

0 

1 

65 

Table  3:  Dimension  mfor  each  class  ( Cl  and  C2)  and  clus¬ 
tering  results  of  the  mixture  of  subject  5  ( all  poses ,  all  il¬ 
luminations)  and  subject  6  (one  pose ,  all  illuminations )  in 
the  Yale  Face  Database  B.  PMM  and  R-TPMM  (a  =  0.25, 
(7  =  1)  algorithms  with  two  different  values  of  k.  The  algo¬ 
rithms  are  applied  in  the  64  x  48  dimensional  space. 


|  GPCA  j 

Souvenir-Pless 

Number  of  points  in  each  class  | 

Cl 

C2 

Cl 

C2 

Subject  5 

325 

260 

476 

109 

Subject  6 

0 

65 

20 

45 

Table  4:  Clustering  results  of  the  mixture  of  subject  5  (all 
poses,  all  illuminations)  and  subject  6  (one  pose,  all  illumi¬ 
nations)  in  the  Yale  Face  Database  B.  We  apply  GPCA  and 
Souvenir-Pless  algorithms  to  the  data  pre-projected  onto  a 
5  dimensional  space. 
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Figure  9:  Examples  of  images  of  subjects  5  and  6  of  the  Yale 
Face  Database  B.  See  results  in  Table  3. 


The  R-TPMM  framework  is  also  tested  to  study  differ¬ 
ent  human  activities  in  video.  We  created  a  point  cloud  with 
the  frames  of  a  video  of  a  person  performing  four  different 
activities:  Standing,  walking,  jumping,  and  arms  waving, 
all  performed  in  a  static  background.  Each  original  frame  is 
480  x  640,  sub-sampled  to  48  x  64  pixels,  with  1673  frames 
(see  some  frame  examples  in  Figure  10).  This  is  mainly  to 
speed  up  computations.  In  video  applications,  one  may  be 
interested  in  temporal  regularization.  For  that,  we  consider 
a  temporal  neighborhood  in  Ur,  more  concretely  we  take 
into  account  the  6  previous  and  6  posterior  frames  in  the 
regularization  term.  The  confusion  matrix  with  the  classifi¬ 
cation  results  using  the  R-TPMM  algorithm  (with  k  =  10, 
a  =  5  and  a  =  0.25)  is  presented  in  Table  5.  The  error  in 
the  classification  affects  only  4%  of  the  frames. 


Figure  10:  Four  sample  frames  of  human  activities  in  video. 


Number  of  samples  in  each  cluster 

Cl 

C2 

C3 

C4 

Standing 

505 

0 

6 

0 

Walking 

0 

464 

45 

14 

Waving 

1 

0 

430 

0 

Jumping 

0 

0 

0 

207 

Table  5:  Classifying  human  activities  in  video  with  the  R- 
TPMM  algorithm  (k  =  10,  a  =  40  and  a  =  0.25).  We 
use  the  6  previous  and  6  posterior  frames  as  neighbors  in 
Ur,  which  results  in  a  temporal  regularization.  The  global 
classification  is  96%  accurate. 


Finally,  we  tested  the  R-TPMM  algorithm  in  a  mo¬ 
tion  segmentation  application.  We  use  a  sequence  of  the 
Kanatani  Lab  7 ,  see  some  examples  of  frames  in  Figure  11. 
This  sequence  was  originally  used  in  [19]  and  then  in  [36]. 
The  data  consists  of  the  2D  projection  coordinates  of  the 
trajectories  along  the  sequence  of  some  interest  points.  The 
sequence  that  we  analyze  corresponds  to  a  car  moving  in 
a  parking  lot  and  there  are  two  different  motions  in  the  se¬ 
quence.  As  in  [36]  we  pre-project  the  data,  originally  in 
a  60-dimensional  space  (2  coordinates  x  30  frames),  onto 
a  5-dimensional  space.  In  Table  6  we  show  the  classifica¬ 
tion  effectiveness  for  different  methods:  Costeira-Kanade, 
Ichimura,  Kanatani- Sugay a  (the  three  of  them  reported  in 
[19]),  Souvenir-Pless,  GPCA  and  R-TPMM.  For  the  R- 
TPMM  we  use  k  =  10,  a  =  2  and  cr  =  0.05.  We  also 
tested  our  algorithm  with  the  other  two  sequences  used  in 
[19,  36]  and  obtained  a  single  class  since  the  two  differ¬ 
ent  motions  have  the  same  dimension  (complexity).  Thus, 
it  is  necessary  to  introduce  an  additional  constraint  in  the 
R-TPMM  approach  in  order  to  deal  with  these  cases. 


Figure  11:  Two  frames  of  a  sequence  of  the  motion  segmen¬ 
tation  database  of  the  Kanatani  Laboratory. 


Method 

Effectiveness 

Costeira-Kanade 

60.3% 

Ichimura 

92.6% 

Kanatani-Sugaya 

100% 

Souvenir-Pless 

93.38% 

GPCA 

100% 

RTPMM 

100% 

Table  6:  Classification  rates,  using  different  methods,  for 
the  motion  segmentation  in  the  Kanatani  Laboratory  se¬ 
quence  (see  example  frames  in  Fig.  11). 

Regarding  the  computational  time,  the  most  expensive 
part  is  the  kNN-graph.  In  the  digits  experiment  (Fig.  7), 
2167  points  of  dimension  784,  the  execution  takes  18.37s 
while  10.29s  of  the  total  time  is  spent  in  the  computation 
of  the  kNN-graph.  For  the  experiment  with  the  Yale  faces 
(Fig.  9,  650  points  of  dimension  3072)  the  execution  time 
is  7.64s  (3.70s  for  the  kNN-graph).  In  the  video  experiment 

7http://www.suri.it.okayama-u.ac.jp/data.html 
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(Fig.  10,  1673  points  of  dimension  3072)  the  total  time 
and  the  kNN-graph  time  are,  respectively,  29.78s  and  24.87 
(CPU:  Pentium  Core  2  Duo,  2.0  GHz,  2.0  GB  memory). 


6  Conclusions 


and  Xj  (resp.  Xi  and  Xj).  We  can  write  D^  as  a  function  of 
the  original  points  Xi  and  Xj : 


Dij  =\\xi  -Xj\\2 

=  (D^  +  || rii-  rij\\l  +  2  <(xt  -  Xj),  (rii  -  nj)>)1/ 2  . 


In  this  paper  we  developed  a  framework  for  the  simulta¬ 
neous  and  regularized/constrained  estimation  of  the  intrin¬ 
sic  dimensionality  and  density  of  high  dimensional  noisy 
point  cloud  data  sampled  from  a  stratification,  as  the  basis 
for  complexity /density  based  soft-clustering.  The  algorithm 
is  based  on  a  statistical  model  which  addresses  the  pres¬ 
ence  of  noise  in  the  measurements.  Our  previous  related 
works  [15,  16]  are  particular  cases  of  the  R-TPMM  algo¬ 
rithm  introduced  in  this  paper.  We  showed  that  regulariza¬ 
tion  constraints  can  be  naturally  introduced  in  this  approach. 
The  experiments  showed  the  importance  of  incorporating 
the  noise  in  the  model  and  also  of  adding  regularization  in 
the  classification.  We  also  showed  that  the  algorithm  is  ro¬ 
bust  to  outliers.  With  the  proper  dissimilarity  function  and 
neighborhood  type,  we  are  able  to  add  spatial  or  temporal 
regularity  in  the  classification  or  intra-class  spatial  compact¬ 
ness.  Other  type  of  constraints  are  possible  under  the  same 
proposed  framework.  Asymptotic  theoretical  results  were 
also  presented. 

We  would  like  to  follow  this  direction  of  work  and  study 
other  constraints  which  can  be  useful  for  stratification  learn¬ 
ing.  One  possibility  is  to  define  a  dissimilarity  function 
which  leads  to  separate  different  manifolds  that  share  the 
same  dimensionality  and  density.  This  will  define  a  new 
constraint  that  will  also  help  in  the  classification  process 
when  there  is  an  intersection  of  two  manifolds  (and  where 
the  algorithm  fails  at  the  present  stage).  Since  the  den¬ 
sity  depends  on  the  dimension,  we  are  intrinsically  giving 
more  importance  to  the  dimension  criterion  in  our  frame¬ 
work.  The  control  of  the  relative  importance  of  these  two 
criteria  needs  also  to  be  addressed. 


Appendix  A:  Estimation  of  the  distri¬ 
bution  of  distance  errors 

In  this  section  we  derive  the  distribution  of  the  error  in  the 
distance  between  a  pair  of  points  when  this  distance  is  com¬ 
puted  from  noisy  points.  We  are  interested  in  the  particular 
case  when  the  noise  follows  an  i.i.d.  Gaussian  distribution 
in  each  of  the  point  coordinates. 

Let  A  =  {xt  G  =  1, . . . , T}  and  X  =  {xt  G 

t  =  1, . . . ,  T}  be  two  point  clouds  which  are  related  in 
the  following  way:  xt  =  xt  +  nt,  for  each  index  t ,  where 
nt  ~  A(0,  a2),  i.e.,  A  is  a  noisy  version  of  X.  Let 
(resp.  Dij)  be  the  Euclidean  distance  between  points  xi 


Expanding  the  previous  expression  in  a  Taylor  series  around 
D^  (considering  the  rest  of  the  terms  sufficiently  small),  we 
obtain, 


Dij  r^Dij 


<(xi  —  Xj),  (rii  —  rij)>  || rii  —  rij 


'3  112 


Di 


2Di 


i  (<(xi  -  Xj),(m  -  Uj)>)  3 

'I - W, - +°M 

=Dij  +  Dni  +  Dn2  +  Dns  +  0(a3). 


In  order  to  estimate  the  probability  density  function  of  the 
three  error  terms  Dni ,  i  =  1 ...  3,  in  D^  we  make  use  of 
the  following  properties: 

1.  If  X  ~  N(n ,  a2)  and  a,  b  G  M,  then  aX+b  ~  N(afi+ 
b ,  (aa)2). 

2.  If  X  ~  N(iix,cf2x)  and  Y  ~  A(//y,ay)  are  inde¬ 
pendent  variables,  then: 

(a)  X  +  Y  ~  N(fix  +  M y  5  & x  +  a y )> 

(b)  X  —  Y  ~  N(iix  ~  Vy,  cj2x  +  4). 

3.  If  Xi, . . . ,  Xp  are  iid  variables  s.t.  Xi  ~  N 4,  a2), 
then  U  =  Yli= i  f  Xi~^%  j  follows  a  Chi-square  dis¬ 
tribution  with  p  degrees  of  freedom,  U~Xp- 

4.  If  A  is  a  random  variable  with  probability  density 
function  f(x)  and  Y  =  a  A,  where  a  G  R,  then,  the 
probability  density  function  of  Y  is  y^/(f  )• 

5.  The  probability  density  function  of  the  sum  of  two  in¬ 
dependent  random  variables  A  and  Y  with  probability 
density  functions  /  and  g  is  the  convolution 


(f*g)(x)  =  j 


f(y)g(x  -  y)  dy. 


The  error  term  Dni  ~  A(0,  2cr2),  by  using  proper¬ 
ties  1,  2(a)  and  2(b)  (notice  that  the  denominator  cancels 
out  the  weights  in  the  numerator  when  adding  the  indi¬ 
vidual  (constant)  variances  in  each  coordinate).  The  sec- 
ond  term,  D„2  ~  \2p  =  ^pf  \2  (A;e)  (properties  2(b) 
and  3).  And  for  the  last  term,  using  properties  1-4, 


Dn 


Xi  = 


32  D 


(-+H 
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Finally,  using  the  previous  results  and  Property  5,  we  can 
write 

Dij  «  Dij  +  W;  where  W  ~  N(0,  2 a2)  *  x2  *  Xi- 


Substituting  the  last  expression  with  values  in  (24)  gives 


<  2a  =  2ak , 


In  Figure  12  we  show  the  distribution  7V(0,  2a2)  com¬ 
pared  with  the  estimated  distribution  VFfora  =  0.5,p  =  3 
and  two  different  values  for  DtJ :  1.0  and  3.0.  As  we  can 
see  in  this  Figure,  for  a  fixed  a,  as  Dzj  gets  larger,  the  dis¬ 
tribution  W  is  closer  to  a  7V(0,  2cr2)  distribution.  Then,  for 
values  not  very  small,  that  is,  for  sufficient  SNR,  we  can 
approximate  the  probability  density  function  of  the  error  in 
the  distance  as  a  Gaussian. 


where  k  is  the  number  of  neighbors  in  the  regularization 
term.  Since  h\  G  [0, 1],  H  is  strictly  negative,  i.e.,  every 


eigenvalue  A  <  0,  if 
a  <  l/(2fc). 


<  1,  and  this  is  true  for 


Appendix  C:  Proof  of  the  asymptotic 
analysis 


Figure  12:  Gaussian  distribution,  N( 0,  2a2)  with  a  =  0.5, 
compared  to  the  estimated  distribution  W  for  two  different 
values  of  DZJ :  1.0  and  3.0. 


When  we  consider  the  particular  case  of  hard  clustering  we 
have 


h3(yt)  = 


1  if  j  =  argmax,/?,*  (?;,), 

0  otherwise. 


The  estimator  of  the  dimension  in  class  j  can  be  expressed 
as 


mJ 


T 


1 

k  —  1 


k- 1 


Elog 


Rk(vt) 

Ri{yt) 


(25) 


where  Nj  is  the  number  of  points  clustered  in  class  j  and 


log-Ri(yt) 


f0R  /(-Rj(yt)|r)logrrfr 

f0R  f(Ri(Vt)\r)dr 


(26) 


In  the  (R-)PMM  approach  we  have  Rz  =  Rz.  We  can 

Appendix  B:  Bound  on  a  for  conver-  rewrite  (25)  as 

gence  mj  =  Nj(k-i)mirz~1,  (27) 


We  now  show  that,  for  a  fixed  ip,  defined  in  (18) 

has  a  global  maximum.  For  that,  we  follow  the  same  lines 
as  in  [2].  Let  us  call  F^(H)  the  functional  (18)  when  i/j  is 
fixed.  F^  ( H )  has  a  global  maximum  if  it  is  strictly  concave, 
i.e.  if  its  Hessian  matrix  H ,  with  components 

—  1  /h{  if  i  =  j  and  l  =  t 

2a  if  i  =  j  and  l  ~  t  , 

0  otherwise, 

(24) 

is  strictly  negative.  The  Gerschgorin-Hadamard  Theorem 
tell  us  that  the  eigenvalues  A  of  this  Hessian  matrix  belong 
to  the  union  of  discs  indexed  by  (j,  t)  and  defined  by 

I A  —  Kjtjtl  E 


Hu. 


jt 


52F 

ShjShl 


where  mJT  is  the  actual  dimension  of  class  j  and  Z  is 


z  =  J2slY*; 

t=  1 


k-1 

Yt=  mT  E log 

7=1 


Rkjyt ) 
Ri{yt) ' 


With  the  proper  definition  of  the  upper  limit  R'  in  the  inte¬ 
gral  in  (26)  and  the  transition  density  f(Ri\r)  when  Rz  is 
close  to  R we  can  guarantee  that  Ri  <  Rk  (always  true  in 
(R-)PMM).  In  this  case,  we  use  the  fact  that  ( Ri/Rk)m°T 
is  distributed,  under  the  Poisson  assumption,  as  a  Uni- 
form(0,l)  distribution,  the  —  log  of  such  a  distribution  is 
an  Exponential  1),  and  then,  the  sum  of  (fc  —  1)  Exponen¬ 
tial  1)  distributed  variables  is  a  Gamma(fc  —  1,1).  Then, 
Yt  ~  Gamma(fc  —  1,1)  and  the  sum  of  Nj  Gamma(&  —  1,1) 
distributions  gives  Z  ~  Gamma(A^(A:  —  1),1)  and  ~ 
Inverse-Gamma((A;  —  The  expectation  of  Z-1  is 
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1  /  (Nj  (k  —  1)  —  1),  and  substituting  in  (27),  considering  that 
1  <  Nj(k  —  1),  yields 


E[mP]  =  mJT  + 


mJT 


Regarding  the  variance, 


Var  [mj]  =  N?(k  —  l)2Var  [Z  1], 


where 

Var[z  1]  =  (¥rFl¥rF2)' 


We  now  define 

2  —  5Nj(k  —  1) 

a:“  Nj(k  —  l)2(A^(fc  —  1)  —  2)  ’ 

After  simple  computations  and  under  the  hypothesis  that 
|a|  <  1,  we  obtain 


Var[mJ]  = 


(mr)2 


- 1)  -  2 


i+E«" 

n=l 


and  since  the  second  term  is  smaller  than  the  first  one,  we 
can  write 


VarK]  =  i«  fO  („j(t  _*!)_■;)  - 
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